Red wine is one of the most beautiful drinks, so it’s going to be interesting to find out what makes a good wine ! :)


Contents

  1. Data set
  2. Exploring data
  3. Univariate plots
  4. Analyzing quality
        4.1 Correlations with other variables
        4.2 Boxplot graphs with other variables
  5. What chemical properties influence the quality
        5.1 Chemical properties correlation table
        5.2 Chemical properties bivariate plots
  6. Building linear regression model
  7. Summary and final plots
  8. Reflection
  9. Refrences
  10. Author and contact information

Date set

The data can be downloaded from this link, also you can find it on my github along with other report resources : link .

Also read this text file which describes the variables and how the data was collected.

The data-set contains 11 chemical characteristics beside a quality from 1 to 10 from at least 3 wine experts for 1599 different wines!


Exploring data

wine <- read.csv('./data/wineQualityReds.csv')

The data has 1599 observations of 13 variables.

The type of data in each column is as follow :

str(wine)
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Also the units of each column :

Input variables (based on physicochemical tests):
1. fixed acidity (tartaric acid - g / dm^3)
2. volatile acidity (acetic acid - g / dm^3)
3. citric acid (g / dm^3)
4. residual sugar (g / dm^3)
5. chlorides (sodium chloride - (g / dm^3)
6. free sulfur dioxide (mg / dm^3)
7. total sulfur dioxide (mg / dm^3)
8. density (g / cm^3)
9. pH
10. sulphates (potassium sulphate - g / dm3)
11. alcohol (% by volume)
Output variable (based on sensory data):
12. quality (score between 0 and 10)


univariate plots

Lets look closer on each variable alone, these density plots shows the normal distribution for each variable.

  • The red lines represents the 25% and 75% quantiles(ie. 25% of the data lies on left for the first line), and the blue one represents the 50% quantile.
  • The grey vertical line represents the mean ( average ).
  • The two dark magenta lines represents 10% and 90% probability (ie. 80% of the data lies between them).
  • And the table contains some descriptive statistics.
## 80% of the records lies in between 6.5 and 10.7 which is 37.2% of the graph.
## 50% of the records lies in between 7.1 and 9.2 which is 18.6% of the graph.

## 80% of the records lies in between 0.31 and 0.745 which is 29.8% of the graph.
## 50% of the records lies in between 0.39 and 0.64 which is 17.1% of the graph.

## 80% of the records lies in between 0.01 and 0.522 which is 51.2% of the graph.
## 50% of the records lies in between 0.09 and 0.42 which is 33% of the graph.

## 80% of the records lies in between 1.7 and 3.6 which is 13% of the graph.
## 50% of the records lies in between 1.9 and 2.6 which is 4.8% of the graph.

## 80% of the records lies in between 0.06 and 0.109 which is 8.2% of the graph.
## 50% of the records lies in between 0.07 and 0.09 which is 3.3% of the graph.

## 80% of the records lies in between 5 and 31 which is 36.6% of the graph.
## 50% of the records lies in between 7 and 21 which is 19.7% of the graph.

## 80% of the records lies in between 14 and 93.2 which is 28% of the graph.
## 50% of the records lies in between 22 and 62 which is 14.1% of the graph.

## 80% of the records lies in between 0.994556 and 0.99914 which is 33.7% of the graph.
## 50% of the records lies in between 0.9956 and 0.997835 which is 16.4% of the graph.

## 80% of the records lies in between 3.12 and 3.51 which is 30.7% of the graph.
## 50% of the records lies in between 3.21 and 3.4 which is 15% of the graph.

## 80% of the records lies in between 0.5 and 0.85 which is 21% of the graph.
## 50% of the records lies in between 0.55 and 0.73 which is 10.8% of the graph.

## 80% of the records lies in between 9.3 and 12 which is 41.5% of the graph.
## 50% of the records lies in between 9.5 and 11.1 which is 24.6% of the graph.

## 80% of the records lies in between 5 and 7 which is 40% of the graph.
## 50% of the records lies in between 5 and 6 which is 20% of the graph.

Analyzing quality

Lets focus on quality.
Although quality are supposed to be from 0 to 10, all records are from 3 to 8, the density of each one is as follow :

table(wine$quality)
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

82.5 % of wines either have quality of 5 or 6 .

Correlations with other variables

Let’s zoom into the correlation between quality and the chemical characteristics :

variable Pearson corr
fixed.acidity 0.12
volatile.acidity -0.39
citric.acid 0.23
residual.sugar 0.01
chlorides -0.13
free.sulfur.dioxide -0.05
total.sulfur.dioxide -0.19
density -0.17
pH -0.06
sulphates 0.25
alcohol 0.48

as we can see the only relatively good correlation is with the alcohol percentage.

Boxplot graphs with other variables

One other way to see the relations is by drawing boxplots .
The following graphs represents boxplots between each quality level [3-8], versus each chemical.

  • The two magenta lines represent the 10% and 90% .
  • The red line represents the median [50%].
  • the black points inside the boxplots and the line attaching them to each other represent the mean for each quality level.

The mean increases from level 4 to 7 .

The mean decreases from level 3 to 7, and increases a little to 8.

The mean remains the same from 3 to 4 then increases to 7 then remains to 8 .

The mean slightly decreases from 3 to 8.

The mean significantly decreases from 3 to 4, then slowly decreases all over the way to 8.

The mean increases from 3 to 5, then decreases from 5 to 8.

The same as free sulfur dioxide, the mean increase from 3 to 5, then decreases from 5 to 8.

The mean decreases from 3 to 4 , and from 5 to 8, but increases from 4 to 5.

The mean remains the same between 3 to 4 , and 5 to 6, and decreases otherwise.

The mean slowly increases all over the way.

The mean significantly increases from 5 to 8, and from 3 to 4 , but decreases from 4 to 5.


So why we are doing that, lets remember what we are seeking for, we want relations between alcohol and the chemical properties.
Correlations gave us the relation with alcohol only but no the others.
But when we saw the boxplots we saw many increases and decreases from different quality level, and we saw the relation between quality and alcohol isn’t perfectly positive.

That leads us to the question in the next part..


What chemical properties influence the quality

which chemcical chracterestics influence the quality, or it there any relation between them !

Logic says yes, but correlations says no except for alcohol, and boplots shows some relations.

Lets think in some different way, instead of searching for the direct relation between each characteristic and quality, let’s find relations between chemical characteristics and each other.

Chemicals’ correlation table

The below correlation table is a good way to find these relations.

The correlations are computed using both Pearson and spearman algorithms, so each element in the table is structured as :     Pearson’s / spearman’s .

Correlations bigger than .3 or less than -.3 are colored in Red.

fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
fixed.acidity 1
volatile.acidity -0.26 / -0.28 1
citric.acid 0.67 / 0.66 -0.55 / -0.61 1
residual.sugar 0.11 / 0.22 0 / 0.03 0.14 / 0.18 1
chlorides 0.09 / 0.25 0.06 / 0.16 0.2 / 0.11 0.06 / 0.21 1
free.sulfur.dioxide -0.15 / -0.18 -0.01 / 0.02 -0.06 / -0.08 0.19 / 0.07 0.01 / 0 1
total.sulfur.dioxide -0.11 / -0.09 0.08 / 0.09 0.04 / 0.01 0.2 / 0.15 0.05 / 0.13 0.67 / 0.79 1
density 0.67 / 0.62 0.02 / 0.03 0.36 / 0.35 0.36 / 0.42 0.2 / 0.41 -0.02 / -0.04 0.07 / 0.13 1
pH -0.68 / -0.71 0.23 / 0.23 -0.54 / -0.55 -0.09 / -0.09 -0.27 / -0.23 0.07 / 0.12 -0.07 / -0.01 -0.34 / -0.31 1
sulphates 0.18 / 0.21 -0.26 / -0.33 0.31 / 0.33 0.01 / 0.04 0.37 / 0.02 0.05 / 0.05 0.04 / 0 0.15 / 0.16 -0.2 / -0.08 1
alcohol -0.06 / -0.07 -0.2 / -0.22 0.11 / 0.1 0.04 / 0.12 -0.22 / -0.28 -0.07 / -0.08 -0.21 / -0.26 -0.5 / -0.46 0.21 / 0.18 0.09 / 0.21 1

from the above table we can conclude the following :

fixed acidity is correlated to citric acid, density and pH.
volatile acidity is correlated to citric acid and sulphates.
citric acid is correlated to volatile, fixed acidity, pH and sulphates.
chlorides is correlated to density and sulphates.
density is correlated to fixed acidity, alcohol, residual sugar and chlorides.
pH is correlated to fixed acidity and citric acid.
sulphates is correlated to volatile acidity, citric acid and chlorides.
residual sugar is correlated to density.
alcohol is correlated to density.


And from that we get this tree :

So we have 7 parent nodes which has children :
  Quality, Alcohol, Density, Fixed Acidity, Chlorides, Citric acid and Volatile acidity.

And all of them depend on each other, so as we know alcohol affects quality, alcohol is affected by density which is affected by other chemicals which is affected…. and so on.

With counting negative and positive correlations, quality value increases when the following happen :

Drawing volatile acidity Drawing pH Drawing Sulphates

Drawing Citric acid Drawing pH Drawing Sulphates

Drawing Fixed acidity Drawing Residual sugar Drawing Chlorides

Drawing Density

Drawing Alcohol

Drawing Quality


Lets go back to our question, WHAT CHEMICAL PROPERTIES INFLUENCE THE QUALITY.

To answer that we must go through the earlier tree from the bottom to the top.

Chemical properties bivariate plots

The below plots explain that, the fist plot has the Quality as Y(dependent), then the next variable in the tree will be the Y of the next plot and so on .

  • the black line represents the line of best fit (linear model).
  • the red line represents the mean.
  • the two blue lines represents the first and third quantile (80% of the points lies in between them).
  • and the orange points are the data .

Lets start with the top element [Quality].

Quality is positively correlated with alcohol, the are a few drop-off points above and below the linear line, lets look to alcohol.

The mean and quantile lines goes up and down but still there is a relation. Lets have a look on density.

Density depends on three variables (fixed acidity, chlorides, and residual sugar), and it’s the line of mean matches the line of best fit.

As shown fixed acidity has positive relation with citric acid and negative one with pH.

Also citric acid has positive relation with sulphates and negative relation with both pH and volatile acidity.


Building linear regression model

After we proved the relation between quality and chemical properties, lets build a regression model so in future if we have chemical properties for some wine, we can predict it’s quality.

Simple linear regression uses an independent variable to predict the outcome of a dependent variable.

we will use the formula Y ~ X , where X represents the relations represented above in the tree.
Because the variables are from different scales, so it would be nicer if all of them are scaled to the same scale. I’ll choose the scale from 0 to 10 , so every element in each variable will have a value from 0 to 10 keeping the statistics not changed.
A new variable is set for the new data called ‘wine.ratio’.

Now lets look at the model :

reg_lm <- lm( quality ~
                                
                                alcohol * density +   
                                density * fixed.acidity +  
                                density * residual.sugar +  
                                density * chlorides +  
                                chlorides * sulphates +   
                                fixed.acidity * pH +  
                                fixed.acidity * citric.acid +  
                                citric.acid * pH +  
                                citric.acid * volatile.acidity +  
                                citric.acid * sulphates+  
                                volatile.acidity * sulphates  
                            
                                         
                            ,data = wine.ratio )  

Slopes :

variable slope
alcohol 0.205***
density -0.012
fixed.acidity -0.013
residual.sugar -0.081
chlorides 0.133
sulphates 0.217**
pH -0.028
citric.acid 0.110
volatile.acidity -0.202***
alcohol x density -0.003
density x fixed.acidity -0.006
density x residual.sugar 0.015
density x chlorides -0.020
chlorides x sulphates -0.036*
fixed.acidity x pH 0.031*
fixed.acidity x citric.acid -0.006
pH x citric.acid -0.036**
citric.acid x volatile.acidity 0.015
sulphates x citric.acid -0.001
sulphates x volatile.acidity -0.004

Intercept and some statistics :

type value
Intercept 5.230***
R-squared 0.362
adj. R-squared 0.354
sigma 0.649
F 44.847
p 0.000
Log-likelihood -1566.812
Deviance 664.474
AIC 3177.623
BIC 3295.920
N 1599
  • Quality can be explained with this model by 36% (R-squared Value).
  • 95 % of the predicted interval should fall within +/- 129.8% of the fitted line.

Lets have some visualization for our model.
The first graph is boxplots for the formula Y ~ X, where Y (wine Quality) as a factor on the x-axis, and X is as shown above the relations between chemical properties and each other on the y-axis.
I’ll use the new data-set here wine.ratio.

As shown above the mean of X is getting higher as quality get higher for the quality ( 3,5,6,7), an exception for 4 and 8, the mean of X at quality 4 is lower than the mean at 3, and the mean at quality 8 is lower the mean at quality 7.
But still we can say the as quality increases the X increases.


The second one will show the difference between the actual quality, and the quality predicted by the regression model. Lets start first make a new column in the data called “quality.predicted”, it will hold the predicted data using the regression model.

wine$quality.predicted <- round( predict(reg_lm, wine.ratio ) )

Now lets plot it :

The bars shows the number of wines having a quality x.
The red ones for the actual quality, and the blue are for the predicted quality.
Most of the predicted quality are 5 and 6, and a little of 7.
The model couldn’t predict the quality of 3,4 and 8, instead it predicted 5 and 6 more than the actual one.


Summary and final plots

We started by wondering about the relation between the quality of wine and it’s chemical properties, it’s clear that there must be a relation, although we proved some week relation but it still week and we can’t count on it .

So how does this make sense !, If we trusted that the chemical test were true and there is no error in the data, so there is error in the human factor !, lets not to forget that the quality is entered by humans and humans always make mistakes!.

So I believe to some degree that many values of the quality are entered from person favorite not because it’s actually high quality.


I chose three plots to summary the analysis we did :

The first one the boxplot graph between alcohol and quality which shows how quality is affected by alcohol precentage.
the highest quality level has mean near to 12% alcohol, and the lowest quality level has a mean near to 10% of alcohol .
And if we considered the 2% difference to not be big, the graph show the opposite, as the alcohol level goes higher from 10 to 12 the quality level goes higher.


The second one is the graph which shows the difference between the actual quality, and the quality predicted by the regression model.
The model predicts the quality 5 and 6 much more higher than other levels.


The third graph is relation between X and Y in the linear regression model.
What is interesting is that it shows the relation between the quality and all the chemical properties in one understandable graph. Also it proves that quality increases with the model although the effect is tiny.


Reflection

For the analysis we just did, I could find the relations between quality and chemical properties, some of them aren’t easy to find as shown in the boxplots graphs, Also I couldn’t be sure that the relation is real not just a coincidence.
I struggled in finding out most fit formula for the linear model, I tried my best and I hope the one I chose is good enough.

So how we really get that relation between wine’s quality and it’s chemical properties ?.

I don’t believe that diving deeper in this data set would give me the answer. So to get the answer we have to find the best data set for it, maybe that data would contain prices, brands, and more accurate quality or drinkers’ review.

Also chemical properties aren’t everything that matters in wine, there still the type of the grape used, the quality of wine brand, any flavors added and other things that haven’t been considered in the data-set.

Another thing, the fact the most of the quality values are 5 or 6 makes it harder to analysis the data, there are no very good wines ( of quality 9 or 10), or very bad wines ( of quality 0, 1 or 2), which confirms also that the data aren’t strong enough to use it and as I said humans make mistakes.


Refrences

The data-set used in this report :

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at:
Elsevier
Pre-press (pdf)
bib


Author and contact information

This analysis is done by a udacity’s Data-analysis Nano-degree program Student as a course project.

Github: https://github.com/bekaa
LinkedIn: https://eg.linkedin.com/in/khaled-salah-48360590
WordPress:: https://khaledsalahblog.wordpress.com
Email: sci.kd.eg@gmail.com